Punctuation Prediction for Unsegmented Transcript Based on Word Vector

نویسندگان

  • Xiaoyin Che
  • Cheng Wang
  • Haojin Yang
  • Christoph Meinel
چکیده

In this paper we propose an approach to predict punctuation marks for unsegmented speech transcript. The approach is purely lexical, with pre-trained Word Vectors as the only input. A training model of Deep Neural Network (DNN) or Convolutional Neural Network (CNN) is applied to classify whether a punctuation mark should be inserted after the third word of a 5-words sequence and which kind of punctuation mark the inserted one should be. TED talks within IWSLT dataset are used in both training and evaluation phases. The proposed approach shows its effectiveness by achieving better result than the state-of-the-art lexical solution which works with same type of data, especially when predicting puncuation position only.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Segmentation-Free Word Embedding for Unsegmented Languages

In this paper, we propose a new pipeline of word embedding for unsegmented languages, called segmentation-free word embedding, which does not require word segmentation as a preprocessing step. Unlike space-delimited languages, unsegmented languages, such as Chinese and Japanese, require word segmentation as a preprocessing step. However, word segmentation, that often requires manually annotated...

متن کامل

Better Punctuation Prediction with Hierarchical Phrase-Based Translation

Punctuation prediction is an important task in spoken language translation and can be performed by using a monolingual phrase-based translation system to translate from unpunctuated to text with punctuation. However, a punctuation prediction system based on phrase-based translation is not able to capture long-range dependencies between words and punctuation marks. In this paper, we propose to e...

متن کامل

Comparative Study on Sentence Boundary Prediction for German and English Broadcast News

We present a comparative study on sentence boundary prediction for German and English broadcast news that explores generalization across different languages. In the feature extraction stage, word pause duration is firstly extracted from word aligned speech, and forward and backward language models are utilized to extract textual features. Then a gradient boosted machine is optimized by grid sea...

متن کامل

Prototype Extraction and Adaptive OCR

ÐTo maintain OCR accuracy with decreasing quality of page image composition, production, and digitization, it is essential to tune the system to each document. We propose a prototype extraction method for document-specific OCR systems. The method automatically generates training samples from unsegmented text images and the corresponding transcripts. It is tolerant of transcription errors, so a ...

متن کامل

Punctuation Prediction using Linear Chain Conditional Random Fields

We investigate the task of punctuation prediction in English sentences without prosodic information. In our approach, stochastic gradient ascent (SGA) is used to maximize log conditional likelihood when learning the parameters of linear-chain conditional random fields. For SGA, two different approximation techniques, namely Collins perceptron and contrastive divergence, are used to estimate the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016